62 research outputs found

    Systematic energy characterization of CMP/SMT processor systems via automated micro-benchmarks

    Get PDF
    Microprocessor-based systems today are composed of multi-core, multi-threaded processors with complex cache hierarchies and gigabytes of main memory. Accurate characterization of such a system, through predictive pre-silicon modeling and/or diagnostic postsilicon measurement based analysis are increasingly cumbersome and error prone. This is especially true of energy-related characterization studies. In this paper, we take the position that automated micro-benchmarks generated with particular objectives in mind hold the key to obtaining accurate energy-related characterization. As such, we first present a flexible micro-benchmark generation framework (MicroProbe) that is used to probe complex multi-core/multi-threaded systems with a variety and range of energy-related queries in mind. We then present experimental results centered around an IBM POWER7 CMP/SMT system to demonstrate how the systematically generated micro-benchmarks can be used to answer three specific queries: (a) How to project application-specific (and if needed, phase-specific) power consumption with component-wise breakdowns? (b) How to measure energy-per-instruction (EPI) values for the target machine? (c) How to bound the worst-case (maximum) power consumption in order to determine safe, but practical (i.e. affordable) packaging or cooling solutions? The solution approaches to the above problems are all new. Hardware measurement based analysis shows superior power projection accuracy (with error margins of less than 2.3% across SPEC CPU2006) as well as max-power stressing capability (with 10.7% increase in processor power over the very worst-case power seen during the execution of SPEC CPU2006 applications).Peer ReviewedPostprint (author’s final draft

    A High Performance, Energy Efficient GALS ProcessorMicroarchitecture with Reduced Implementation Complexity

    Full text link
    As the costs and challenges of global clock distribution grow with each new microprocessor generation, a Globally Asynchronous, Locally Synchronous (GALS) approach becomes an attractive alternative. One proposed GALS approach, called a Multiple Clock Domain (MCD) processor, achieves impressive energy savings for a relatively low performance cost. However, the approach requires separating the processor into four domains, including separating the integer and memory domains which complicates load scheduling, and the implementation of 32 voltage and frequency levels in each domain. In addition, the hardwarebased control algorithm, though effective overall, produces a significant performance degradation for some applications. In this paper, we devise modifications to the MCD design that retain many of its benefits while greatly reducing the implementation complexity. We first determine that the synchronization channels that are most responsible for the MCD performance degradation are those involving cache access, and propose merging the integer and memory domains to virtually eliminate this overhead. We further propose significantly reducing the number of voltage levels, separating the Reorder Buffer into its own domain to permit front-end frequency scaling, separating the L2 cache to permit standard power optimizations to be used, and a new online algorithm that provides consistent results across our benchmark suite. The overall result is a significant reduction in the performance degradation of the original MCD approach and greater energy savings, with a greatly simplified microarchitecture that is much easier to implement

    Adaptive power shifting for power-constrained heterogeneous systems

    Get PDF
    The number and heterogeneity of compute devices, even within a single compute node, has been steadily on the rise. Since all systems must operate under a power cap, the number of discrete devices that can run simultaneously at their highest frequency is limited by the globally-imposed power cap. Current systems incorporate a centralized power management unit that statically controls the distribution of power among the devices within the node. However, such static distribution policies are unaware of the dynamic utilization profile across the devices, which leads to unfair power allocations that end up degrading system throughput performance. The problem is particularly acute in the presence of heterogeneity since type-specific performance-boost capabilities cannot be leveraged via utilization-agnostic static power allocations. This paper proposes Adaptive Power Shifting for multi-accelerator heterogeneous systems (APS), a technique that leverages system utilization information to dynamically allocate and re-distribute power budgets across multiple discrete devices. Democratizing the power allocation based on dynamic needs results in dramatic speedup over a need-agnostic static allocation. We use APS in a real OpenPOWER compute node with 2 CPUs and 4 GPUs to demonstrate the value of on-demand, equitable power allocations. Overall, the proposed solution increases performance with respect to two state-of-the-art techniques by up to 14.9% and 13.8%.This work has been partially supported by the European Union’s Horizon 2020 research and innovation program under the Mont-Blanc 2020 project (grant agreement 779877), by the Spanish Ministry of Science and Innovation (contract PID2019-107255GB-C22), by Generalitat de Catalunya (contracts 2014-SGR-1051 and 2014-SGR-1272) and by the IBM/BSC Deep Learning Center initiative. Ll. Alvarez has been supported in part by the Spanish Ministry of Economy, Industry and Competitiveness under the Juan de la Cierva Formacion fellowship No. FJCI-2016- 30984. M. Moreto has been supported in part by the Spanish Ministry of Economy, Industry and Competitiveness under Ramon y Cajal fellowship No. RYC-2016-21104.Peer ReviewedPostprint (author's final draft

    Software-controlled priority characterization of POWER5 processor

    Get PDF
    Due to the limitations of instruction-level parallelism, thread-level parallelism has become a popular way to improve processor performance. One example is the IBM POWER5TM processor, a two-context simultaneous-multithreaded dual-core chip. In each SMT core, the IBM POWER5 features two levels of thread resource balancing and prioritization. The first level provides automatic in-hardware resource balancing, while the second level is a software-controlled priority mechanism that presents eight levels of thread priorities. Currently, software-controlled prioritization is only used in limited number of cases in the software platforms due to lack of performance characterization of the effects of this mechanism. In this work, we characterize the effects of the software-based prioritization on several different workloads. We show that the impact of the prioritization significantly depends on the workloads coscheduled on a core. By prioritizing the right task, it is possible to obtain more than two times of throughput improvement for synthetic workloads compared to the baseline. We also present two application case studies targeting two different performance metrics: the first case study improves overall throughput by 23.7% and the second case study reduces the total execution time by 9.3%. In addition, we show the circumstances when a background thread can be run transparently without affecting the performance of the foreground thread.Peer ReviewedPostprint (published version

    Integrating Adaptive On-Chip Storage Structures for Reduced Dynamic Power

    Get PDF
    Energy efficiency in microarchitectures has become a necessity. Significant dynamic energy savings can be realized for adaptive storage structures such as caches, issue queues, and register files by disabling unnecessary storage resources. Prior studies have analyzed individual structures and their control. A common theme to these studies is exploration of the configuration space and use of system IPC as feedback to guide reconfiguration. However, when multiple structures adapt in concert, the number of possible configurations increases dramatically, and assigning causal effects to IPC change becomes problematic. To overcome this issue, we introduce designs that are reconfigured solely on local behavior. We introduce a novel cache design that permits direct calculation of efficient configurations. For buffer and queue structures, limited histogramming permits precise resizing control. When applying these techniques we show energy savings of up to 70% on the individual structures, and savings averaging 30% overall for the portion of energy attributed to these structures with an average of 2.1% performance degradation

    Dynamically Tuning Processor Resources with Adaptive Processing

    Get PDF
    The productivity of modern society has become inextricably linked to its ability to produce energy-efficient computing technology. Increasingly sophisticated mobile computing systems, powered for hours solely by batteries, continue to proliferate rapidly throughout society, while battery technology improves at a much slower pace. In large data centers that handle everything from online orders for a dot-com company to sophisticated Web searches, row upon row of tightly packed computers may be warehoused in a city block. Microprocessor energy wastage in such a facility directly translates into higher electric bills. Simply receiving sufficient electricity from utilities to power such a center is no longer certain. Given this situation, energy efficiency has rapidly moved to the forefront of modern microprocessor design. The adaptive processing approach to improving microprocessor energy efficiency dynamically tunes major microprocessor resources—such as caches and hardware queues—during execution to better match varying application needs.1,2 This tuning usually involves reducing the size of a resource when its full capabilities are not needed, then restoring the disabled portions when they are needed again. Dynamically tailoring processor resources in active use contrasts sharply with techniques that simply turn off entire sections of a processor when they become idle. Presenting the application with the required amount of hardware—and nothing more— throughout its execution can achieve a potentially significant reduction in energy consumption. The challenges facing adaptive processing lie in achieving this greater efficiency with reasonable hardware and software overhead, and doing so without incurring undue performance loss. Unlike reconfigurable computing, which typically uses very different technology such as FPGAs, adaptive processing exploits the dynamic superscalar design approach that developers have used successfully in many generations of general-purpose processors. Whereas reconfigurable processors must demonstrate performance or energy savings large enough to overcome very large clock frequency and circuit density disadvantages, adaptive processors typically have baseline overheads of only a few percent

    ABSTRACT Tradeoffs in Power-Efficient Issue Queue Design

    No full text
    A major consumer of microprocessor power is the issue queue. Several microprocessors, including the Alpha 21264 and POWER4  ¢¡ use a compacting latch-based issue queue design which has the advantage of simplicity of design and verification. The disadvantage of this structure, however, is its high power dissipation. In this paper, we explore different issue queue power optimization techniques that vary not only in their performance and power characteristics, but in how much they deviate from the baseline implementation. By developing and comparing techniques that build incrementally on the baseline design, as well as those that achieve higher power savings through a more significant redesign effort, we quantify the extra benefit the higher design cost techniques provide over their more straightforward counterparts
    • …
    corecore